Expedia Hotel Recommendations Kaggle competition

Peeter Piksarv (piksarv .at. gmail.com)

The latest version of this Jupyter notebook is available at https://github.com/ppik/playdata/tree/master/Kaggle-Expedia

This is my take on that particular Kaggle competition started off using Dataquest tutorial by Vik Paruchuri.



In [1]:

    
import itertools
import operator
import random

import matplotlib.pyplot as plt
import ml_metrics as metrics
import numpy as np
import pandas as pd
import sklearn
import sklearn.decomposition
import sklearn.ensemble

%matplotlib notebook

Data import

Actually don't need to unpack gzipped cvs files, pandas' read_csv can handle those, although it can be slower (Reading 1000000 rows from train.csv.gz seems to be about 9% slower than from train.csv on my laptop).

Additionally, it's a good idea to specify the data types for each column tho ease the memory requirements. By default pandas detects the following data types:



In [2]:

    
train = pd.read_csv('data/train.csv.gz', nrows=10)
train.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 24 columns):
date_time                    10 non-null object
site_name                    10 non-null int64
posa_continent               10 non-null int64
user_location_country        10 non-null int64
user_location_region         10 non-null int64
user_location_city           10 non-null int64
orig_destination_distance    6 non-null float64
user_id                      10 non-null int64
is_mobile                    10 non-null int64
is_package                   10 non-null int64
channel                      10 non-null int64
srch_ci                      10 non-null object
srch_co                      10 non-null object
srch_adults_cnt              10 non-null int64
srch_children_cnt            10 non-null int64
srch_rm_cnt                  10 non-null int64
srch_destination_id          10 non-null int64
srch_destination_type_id     10 non-null int64
is_booking                   10 non-null int64
cnt                          10 non-null int64
hotel_continent              10 non-null int64
hotel_country                10 non-null int64
hotel_market                 10 non-null int64
hotel_cluster                10 non-null int64
dtypes: float64(1), int64(20), object(3)
memory usage: 2.0+ KB

According to the specification the data fields are following:

train.csv

Column name	Description	Data type	Equiv. type	Notes
date_time	Timestamp	string		[1]
site_name	ID of the Expedia point of sale	int	np.int32
posa_continent	ID of continent associated with site_name	int	np.int32
user_location_country	The ID of the country the customer is located	int	np.int32
user_location_region	The ID of the region the customer is located	int	np.int32
user_location_city	The ID of the city the customer is located	int	np.int32
orig_destination_distance	Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated	double	np.float64
user_id	ID of user	int	np.int32
is_mobile	1 when a user connected from a mobile device, 0 otherwise	tinyint	np.uint8	[2]
is_package	1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise	int	np.uint8	[2]
channel	ID of a marketing channel	int	np.int32
srch_ci	Checkin date	string		[1]
srch_co	Checkout date	string		[1]
srch_adults_cnt	The number of adults specified in the hotel room	int	np.int32
srch_children_cnt	The number of (extra occupancy) children specified in the hotel room	int	np.int32	[4]
srch_rm_cnt	The number of hotel rooms specified in the search	int	np.int32	[4]
srch_destination_id	ID of the destination where the hotel search was performed	int	np.int32
srch_destination_type_id	Type of destination	int	np.int32
hotel_continent	Hotel continent	int	np.int32
hotel_country	Hotel country	int	np.int32
hotel_market	Hotel market	int	np.int32
is_booking	1 if a booking, 0 if a click	tinyint	np.uint8	[2]
cnt	Numer of similar events in the context of the same user session	bigint	np.int64
hotel_cluster	ID of a hotel cluster	int	np.int32

destinations.csv

Column name	Description	Data type	Equiv. type	Notes
srch_destination_id	ID of the destination where the hotel search was performed	int	np.int32
d1-d149	latent description of search regions	double	np.float64	[3,5]

Notes

Probably it would be good idea to parse dates while loading data. From date information useful features may include duration of the stay, season/month, how much in advance is the booking made, etc.
May use np.bool instead.
Single or even half-precision might be enough when starting to take account descriptions of search regions.
If taking into account if the column srch_children_cnt or srch_room_cnt it may be worthwhile to simply this first to a boolean values if any number of children and/or rooms was specifien in the hotel room.
Maybe the required clustering can be done solely on the base of latent descriptions of search regions.



In [2]:

    
traincols = ['date_time', 'site_name', 'posa_continent', 'user_location_country',
             'user_location_region', 'user_location_city', 'orig_destination_distance',
             'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
             'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
             'srch_destination_type_id', 'is_booking', 'cnt', 'hotel_continent',
             'hotel_country', 'hotel_market', 'hotel_cluster']
testcols = ['id', 'date_time', 'site_name', 'posa_continent', 'user_location_country',
            'user_location_region', 'user_location_city', 'orig_destination_distance',
            'user_id', 'is_mobile', 'is_package', 'channel', 'srch_ci', 'srch_co',
            'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'srch_destination_id',
            'srch_destination_type_id', 'hotel_continent', 'hotel_country', 'hotel_market']

Finding columns in testcols but not in traincols and vice versa:



In [4]:

    
[col for col in testcols if col not in traincols]









    Out[4]:





['id']



In [5]:

    
[col for col in traincols if col not in testcols]









    Out[5]:





['is_booking', 'cnt', 'hotel_cluster']

I don't know exactly what data colmuns I will be using eventually but I will define the data types for them here anyway just in case. Looking at the data most of the columns are actually non-negative integers so I can use unsigned integers for the most cases. Usage between uint8, uint32 and others was determined by the min and max values in the test dataset.



In [3]:

    
def read_csv( filename, cols, nrows=None ):
    datecols = ['date_time', 'srch_ci', 'srch_co']
    dateparser = lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S', errors='coerce')

    dtypes = {
        'id': np.uint32,
        'site_name': np.uint8,
        'posa_continent': np.uint8,
        'user_location_country': np.uint16,
        'user_location_region': np.uint16,
        'user_location_city': np.uint16,
        'orig_destination_distance': np.float32,
        'user_id': np.uint32,
        'is_mobile': bool,
        'is_package': bool,
        'channel': np.uint8,
        'srch_adults_cnt': np.uint8,
        'srch_children_cnt': np.uint8,
        'srch_rm_cnt': np.uint8,
        'srch_destination_id': np.uint32,
        'srch_destination_type_id': np.uint8,
        'is_booking': bool,
        'cnt': np.uint64,
        'hotel_continent': np.uint8,
        'hotel_country': np.uint16,
        'hotel_market': np.uint16,
        'hotel_cluster': np.uint8,
    }

    df = pd.read_csv(
        filename,
        nrows=nrows,
        usecols=cols,
        dtype=dtypes, # dtype can also specify datatypes for columns that do not excist in the particular datafile
        parse_dates=[col for col in datecols if col in cols], # columns here must be also in usecols
        date_parser=dateparser,
    )
    return df



In [5]:

    
train = read_csv('data/train.csv.gz', nrows=None, cols=traincols)
train.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37670293 entries, 0 to 37670292
Data columns (total 24 columns):
date_time                    datetime64[ns]
site_name                    uint8
posa_continent               uint8
user_location_country        uint16
user_location_region         uint16
user_location_city           uint16
orig_destination_distance    float32
user_id                      uint32
is_mobile                    bool
is_package                   bool
channel                      uint8
srch_ci                      datetime64[ns]
srch_co                      datetime64[ns]
srch_adults_cnt              uint8
srch_children_cnt            uint8
srch_rm_cnt                  uint8
srch_destination_id          uint32
srch_destination_type_id     uint8
is_booking                   bool
cnt                          uint64
hotel_continent              uint8
hotel_country                uint16
hotel_market                 uint16
hotel_cluster                uint8
dtypes: bool(3), datetime64[ns](3), float32(1), uint16(5), uint32(2), uint64(1), uint8(9)
memory usage: 2.3 GB

With these type definitions the entire training set of 37 million entries takes 2.3 GB of memory.



In [6]:

    
test = read_csv('data/test.csv.gz', cols=testcols)
test.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528243 entries, 0 to 2528242
Data columns (total 22 columns):
id                           uint32
date_time                    datetime64[ns]
site_name                    uint8
posa_continent               uint8
user_location_country        uint16
user_location_region         uint16
user_location_city           uint16
orig_destination_distance    float32
user_id                      uint32
is_mobile                    bool
is_package                   bool
channel                      uint8
srch_ci                      datetime64[ns]
srch_co                      datetime64[ns]
srch_adults_cnt              uint8
srch_children_cnt            uint8
srch_rm_cnt                  uint8
srch_destination_id          uint32
srch_destination_type_id     uint8
hotel_continent              uint8
hotel_country                uint16
hotel_market                 uint16
dtypes: bool(2), datetime64[ns](3), float32(1), uint16(5), uint32(3), uint8(8)
memory usage: 144.7 MB

Finding missing values in test data:



In [9]:

    
test.isnull().sum()









    Out[9]:





id                                0
date_time                         0
site_name                         0
posa_continent                    0
user_location_country             0
user_location_region              0
user_location_city                0
orig_destination_distance    847461
user_id                           0
is_mobile                         0
is_package                        0
channel                           0
srch_ci                          22
srch_co                          17
srch_adults_cnt                   0
srch_children_cnt                 0
srch_rm_cnt                       0
srch_destination_id               0
srch_destination_type_id          0
hotel_continent                   0
hotel_country                     0
hotel_market                      0
dtype: int64

There are also some dates where the check in date is later than check out date:



In [10]:

    
(test.srch_ci > test.srch_co).sum()









    Out[10]:





2184

Checking that all of the user_id-s in test set are contained in training set



In [7]:

    
test_ids = set(test.user_id.unique())
train_ids = set(train.user_id.unique())
test_ids <= train_ids # issubset









    Out[7]:





True

However, not all all user_ids that are in training data are in



In [12]:

    
len(train_ids - test_ids)









    Out[12]:





17209

Extract month and year field from the date



In [8]:

    
train['month'] = train['date_time'].dt.month.astype(np.uint8)
train['year'] = train['date_time'].dt.year.astype(np.uint16)

Pick 10000 users for smaller scale testing



In [12]:

    
sel_user_ids = sorted(random.sample(train_ids, 10000))
sel_train = train[train.user_id.isin(sel_user_ids)]

Create new test and training sets



In [13]:

    
t1 = sel_train[((sel_train.year == 2013) | ((sel_train.year == 2014) & (sel_train.month < 8)))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >= 8))]

Remove click events from t2 as in original test data.



In [14]:

    
t2 = t2[t2.is_booking == True]

Model 0: Most common clusters

Starting looking at the most common clusters and their properties.



In [57]:

    
most_common_clusters = list(train.hotel_cluster.value_counts().head().index)

Predicting most_common_clusters for every single row in selected test data.



In [13]:

    
predictions = [most_common_clusters for i in range(len(t2))]

Calculating Mean Average Precision with mapk from ml_metrics.



In [14]:

    
target = [[l] for l in t2['hotel_cluster']]
metrics.mapk(target, predictions, k=5)









    Out[14]:





0.066735918744228989

That's not too great.

Model 1



In [20]:

    
#train.corr()['hotel_cluster']
# Calculating the correlations takes a while. No linear correlations were anyhow found in tutorial.

Generating features from destinations



In [25]:

    
dest = pd.read_csv('data/destinations.csv.gz')
dest.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62106 entries, 0 to 62105
Columns: 150 entries, srch_destination_id to d149
dtypes: float64(149), int64(1)
memory usage: 71.1 MB



In [26]:

    
dest.head()









    Out[26]:






  
    
      
      srch_destination_id
      d1
      d2
      d3
      d4
      d5
      d6
      d7
      d8
      d9
      ...
      d140
      d141
      d142
      d143
      d144
      d145
      d146
      d147
      d148
      d149
    
  
  
    
      0
      0
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -1.897627
      -2.198657
      -2.198657
      -1.897627
      ...
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
      -2.198657
    
    
      1
      1
      -2.181690
      -2.181690
      -2.181690
      -2.082564
      -2.181690
      -2.165028
      -2.181690
      -2.181690
      -2.031597
      ...
      -2.165028
      -2.181690
      -2.165028
      -2.181690
      -2.181690
      -2.165028
      -2.181690
      -2.181690
      -2.181690
      -2.181690
    
    
      2
      2
      -2.183490
      -2.224164
      -2.224164
      -2.189562
      -2.105819
      -2.075407
      -2.224164
      -2.118483
      -2.140393
      ...
      -2.224164
      -2.224164
      -2.196379
      -2.224164
      -2.192009
      -2.224164
      -2.224164
      -2.224164
      -2.224164
      -2.057548
    
    
      3
      3
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.115485
      -2.177409
      -2.177409
      -2.177409
      ...
      -2.161081
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
      -2.177409
    
    
      4
      4
      -2.189562
      -2.187783
      -2.194008
      -2.171153
      -2.152303
      -2.056618
      -2.194008
      -2.194008
      -2.145911
      ...
      -2.187356
      -2.194008
      -2.191779
      -2.194008
      -2.194008
      -2.185161
      -2.194008
      -2.194008
      -2.194008
      -2.188037
    
  

5 rows × 150 columns



In [27]:

    
pca = sklearn.decomposition.PCA(n_components=3)
dest_small = pca.fit_transform(dest[['d{}'.format(i) for i in range(1,150)]])
dest_small = pd.DataFrame(dest_small)
dest_small['srch_destination_id'] = dest['srch_destination_id']



In [46]:

    
dest_small.head()









    Out[46]:






  
    
      
      0
      1
      2
      srch_destination_id
    
  
  
    
      0
      0.044268
      -0.169419
      -0.032522
      0
    
    
      1
      0.440761
      -0.077405
      0.091572
      1
    
    
      2
      -0.001033
      -0.020677
      -0.012108
      2
    
    
      3
      0.480467
      0.040345
      0.019320
      3
    
    
      4
      0.207253
      0.042694
      0.011744
      4

The variance ratio that is retained using principal components analysis with 3 principal components:



In [49]:

    
sum(pca.explained_variance_ratio_)









    Out[49]:





0.61572578062540773

Generating features

New date features based on date_time, srch_ci, and srch_co.
Remove non-numeric columns like date_time.
Add in features from dest_small.
Replace any missing values with -1. (Initially planned to use unsigned integers for most of the variables, using -1 as fill value would not work then. May test with replacing na's using the most common values.



In [62]:

    
def calc_fast_features(df):
    # Assumes that the data frame date_time, srch_ci and srch_co are already converted to datetime.
    props = {}
    for prop in ['month', 'day', 'hour', 'minute', 'dayofweek', 'quarter']:
        props[prop] = getattr(df['date_time'].dt, prop)

    carryover = [p for p in df.columns if p not in ['date_time', 'srch_ci', 'srch_co']]
    for prop in carryover:
        props[prop] = df[prop]

    date_props = ['month', 'day', 'dayofweek', 'quarter']
    for prop in date_props:
        props['ci_{}'.format(prop)] = getattr(df['srch_ci'].dt, prop)
        props['co_{}'.format(prop)] = getattr(df['srch_co'].dt, prop)
    props['stay_span'] = (df['srch_co'] - df['srch_ci']).astype('timedelta64[h]')

    ret = pd.DataFrame(props)

    ret = ret.join(dest_small, on='srch_destination_id', how='left', rsuffix='dest')
    ret = ret.drop('srch_destination_iddest', axis=1)
    return ret



In [63]:

    
df = calc_fast_features(t1)

Using mean values to fill missing data.



In [74]:

    
df = df.fillna(df.mean())

Random forrest classifier

Using 5-fold cross validation on the training data to estimate error.



In [82]:

    
predictors = [c for c in df.columns if c not in ['hotel_cluster']]

clf = sklearn.ensemble.RandomForestClassifier(
    n_estimators=10, 
    min_weight_fraction_leaf=0.1,
)
scores = sklearn.cross_validation.cross_val_score(
    clf,
    df[predictors],
    df['hotel_cluster'],
    cv=5,
)
scores









    Out[82]:





array([ 0.0652551 ,  0.06915912,  0.06577469,  0.06241378,  0.06689502])

Classifier accuracy seems rather low here as well.

Random forrest with binary classifier

Random forests should work better if only a single hotel cluster is predicted at times.



In [107]:

    
all_probs = []
unique_clusters = df['hotel_cluster'].unique()
for cluster in unique_clusters:
    df['target'] = 0
    df.loc[df['hotel_cluster'] == cluster, 'target'] = 1
    predictors = [c for c in df.columns if c not in ['hotel_cluster', 'target']]
    probs = []
    cv = sklearn.cross_validation.KFold(len(df), n_folds=5)
    clf = sklearn.ensemble.RandomForestClassifier(
        n_estimators=10,
        min_weight_fraction_leaf=0.1,
    )
    for i, (tr, te) in enumerate(cv):
        clf.fit(df[predictors].iloc[tr], df['target'].iloc[tr])
        preds = clf.predict_proba(df[predictors].iloc[te])
        probs.append(p[1] for p in preds)
    full_probs = itertools.chain.from_iterable(probs)
    all_probs.append(list(full_probs))

prediction_frame = pd.DataFrame(all_probs).T
prediction_frame.columns = unique_clusters
def find_top5(row):
    return list(row.nlargest(5).index)

preds = []
for index, row in prediction_frame.iterrows():
    preds.append(find_top5(row))

metrics.mapk([[l] for l in t2['hotel_cluster']], preds, k=5)









    Out[107]:





0.046900535362073816

Using just the most popular clusters gives better scores. The approach here isn't particularly promising. One thing to note is that the input is full of categorical features. Therefore, to properly apply machine learning converting those values to separate binary features may be more appropriate approach.

Model 2

Finding top hotel clusters for each destination.



In [16]:

    
def make_key(items):
    return '_'.join([str(i) for i in items])



In [71]:

    
match_cols = ['srch_destination_id']
cluster_cols = match_cols + ['hotel_cluster']
groups = t1.groupby(cluster_cols)



In [73]:

    
top_clusters = {}
for name, group in groups:
    bookings = group['is_booking'].sum()
    clicks = len(group) - bookings
    
    score = bookings + .15*clicks
    
    clus_name = make_key(name[:len(match_cols)])
    if clus_name not in top_clusters:
        top_clusters[clus_name] = {}
    top_clusters[clus_name][name[-1]] = score

This dictionary has a key of srch_destination_id and each value is another dictionary, with hotel clusters as keys and scores as values.

Finding the top 5 for each destination.



In [19]:

    
cluster_dict = {}
for n in top_clusters:
    tc = top_clusters[n]
    top = [l[0] for l in sorted(tc.items(), key=operator.itemgetter(1), reverse=True)[:5]]
    cluster_dict[n] = top

Making predictions based on destination



In [20]:

    
preds = []
for index, row in t2.iterrows():
    key = make_key([row[m] for m in match_cols])
    if key in cluster_dict:
        preds.append(cluster_dict[key])
    else:
        preds.append(most_common_clusters)

metrics.mapk([[l] for l in t2["hotel_cluster"]], preds, k=5)









    Out[20]:





0.24709650231774125



In [ ]:

    
cluster_dict

Data leak

Utilizing the data leak, that allows matching users in the training set from the testing set using a set of columns inculid user_location_country, and user_location_region.



In [41]:

    
match_cols = [
    'user_location_country',
    'user_location_region',
    'user_location_city',
    'hotel_market',
    'orig_destination_distance',
]

groups = t1.groupby(match_cols)

def generate_exact_matches(row, match_cols):
    index = tuple(row[t] for t in match_cols)
    try:
        group = groups.get_group(index)
    except KeyError:
        return []
    clus = list(set(group.hotel_cluster))
    return clus

exact_matches = []
for i in range(t2.shape[0]):
    exact_matches.append(generate_exact_matches(t2.iloc[i], match_cols))

Combining predictions

Combine exact_matches, preds, and most_common_clusters.



In [43]:

    
def f5(seq, idfun=None):
    """Uniquify a list by Peter Bengtsson
    https://www.peterbe.com/plog/uniqifiers-benchmark
    """
    if idfun is None:
        def idfun(x):
            return x

    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        if marker in seen:
            continue
        seen[marker] = 1
        result.append(item)
    return result



In [44]:

    
full_preds = [
    f5(exact_matches[p] + preds[p] + most_common_clusters)[:5]
    for p
    in range(len(preds))
]
metrics.mapk([[l] for l in t2["hotel_cluster"]], full_preds, k=5)









    Out[44]:





0.28300884955752215

Submission file

I'll clean up the code here and make a separate script that uses full dataset for generating a submission file.

Here I'll just test the file making part.



In [56]:

    
write_p = [" ".join([str(l) for l in p]) for p in full_preds]
write_frame = ["{},{}".format(t2.index[i], write_p[i]) for i in range(len(full_preds))]
write_frame = ["id,hotel_clusters"] + write_frame
with open('predictions.csv', 'w+') as f:
    f.write('\n'.join(write_frame))

-- Peeter Piksarv

	srch_destination_id	d1	d2	d3	d4	d5	d6	d7	d8	d9	...	d140	d141	d142	d143	d144	d145	d146	d147	d148	d149
0	0	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-1.897627	-2.198657	-2.198657	-1.897627	...	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657	-2.198657
1	1	-2.181690	-2.181690	-2.181690	-2.082564	-2.181690	-2.165028	-2.181690	-2.181690	-2.031597	...	-2.165028	-2.181690	-2.165028	-2.181690	-2.181690	-2.165028	-2.181690	-2.181690	-2.181690	-2.181690
2	2	-2.183490	-2.224164	-2.224164	-2.189562	-2.105819	-2.075407	-2.224164	-2.118483	-2.140393	...	-2.224164	-2.224164	-2.196379	-2.224164	-2.192009	-2.224164	-2.224164	-2.224164	-2.224164	-2.057548
3	3	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409	-2.115485	-2.177409	-2.177409	-2.177409	...	-2.161081	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409	-2.177409
4	4	-2.189562	-2.187783	-2.194008	-2.171153	-2.152303	-2.056618	-2.194008	-2.194008	-2.145911	...	-2.187356	-2.194008	-2.191779	-2.194008	-2.194008	-2.185161	-2.194008	-2.194008	-2.194008	-2.188037

	0	1	2	srch_destination_id
0	0.044268	-0.169419	-0.032522	0
1	0.440761	-0.077405	0.091572	1
2	-0.001033	-0.020677	-0.012108	2
3	0.480467	0.040345	0.019320	3
4	0.207253	0.042694	0.011744	4